For data \(\{(\boldsymbol{x}_i, y_i); i = 1, \ldots, N\}\) with \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\),
\[y_i = f(\boldsymbol{x}_i) + \varepsilon_i\]
We propose a underlying function,
\[f(\cdot) \sim\mathcal{MVN} \left( \mu(\boldsymbol{x};\boldsymbol\theta_\mu), k(\boldsymbol{x}, \boldsymbol{x'}; \boldsymbol{\theta}_k) \right)\]
where \(\mu(\cdot)\) is the mean function and \(k(\cdot)\) is the covariance kernel function, with hyperparameters \(\boldsymbol\theta_\mu\) and \(\boldsymbol\theta_k\), respectively.
If we were to take many realisations of a GP, the mean of these over the support would be the specified mean function.
For example, \(\mu(\boldsymbol{x}) = \boldsymbol{0}\), \(k(x, x') = \exp\{-\|x - x'\|^2\}\):
\[ \left[ {\begin{array}{c} \boldsymbol{y}\\ \boldsymbol{f^*}\\ \end{array} } \right] \sim \mathcal{N} \left(\left[ \begin{array}{c} \boldsymbol{\mu}\\ \boldsymbol{\mu^*} \\ \end{array} \right], \left[ \begin{array}{cc} k(\boldsymbol{x}, \boldsymbol{x}) + C& k(\boldsymbol{x}, \boldsymbol{x^*})\\ k(\boldsymbol{x^*}, \boldsymbol{x}) & k(\boldsymbol{x^*}, \boldsymbol{x^*}) \\\end{array}\right] \right) \]
where \(\boldsymbol{x}\) are observed points whose values are \(\boldsymbol{y}\), and \(\boldsymbol{x^*}\) are target points with predicted values \(\boldsymbol{f^*}\).
The reconstruction \(\boldsymbol{f^*}\) is dependent on the choice of \(\boldsymbol{\mu^*}\) and \(k(\cdot)\) which themselves have hyperparameters \(\boldsymbol{\theta}\).
We actually want the reconstruction conditioned on observations, \(\boldsymbol{y}\).
\[\boldsymbol{f^*} \vert \; \boldsymbol{y} \sim \mathcal{MVN}(\boldsymbol{\bar{f^*}}, \mathrm{Cov}(\boldsymbol{f^*}))\]
Joint probabilities expressed as conditional probabilities:
\[P(A, B) = P(A|B) \times P(B)\]
\[P(A,B | C) = P(A|B,C) \times P(B|C)\]
Variables can be “integrated out”:
\[\begin{align} P(X) = &\int_y P(X, Y = y) \mathrm{d}y\\ = & \int_y P(X \vert Y = y) \times P(Y = y) \mathrm{d}y \end{align}\]
Marginal likelihood of conditional distribution
\[p(\boldsymbol{y} | \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) = \int p(\boldsymbol{y}|\boldsymbol{f}, \boldsymbol{x}) \times p(\boldsymbol{f}|\boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) \;d\boldsymbol{f}\]
We can choose values for \(\mu\) and \(\theta\) that maximise this quantity to get the “best fit” Gaussian process.
There are sometimes analytical solutions for this.
Apply GP regression with
\[k(x, x'; \sigma_f, \ell) = \sigma_f^2 \exp\left\{-\frac{1}{2}\left( \frac{x - x'}{\ell}\right)^2\right\}\qquad \eta, \ell \gt 0\]
Find the “best” hyperparameter settings by optimising the marginal likelihood. \[p(\boldsymbol{y} | \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) = \int p(\boldsymbol{y}|\boldsymbol{f}, \boldsymbol{x}) \times p(\boldsymbol{f}|\boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) \;d\boldsymbol{f}\]
Marginalise over all values of hyperparameters to remove their effects. \[p(\boldsymbol{y} \vert \boldsymbol{x}, \boldsymbol{\mu}) = \iint p(\boldsymbol{y} \vert \boldsymbol{f}, \boldsymbol{x})\, p(\boldsymbol{f}\vert \boldsymbol{x}, \boldsymbol{\theta}) \mathrm{d}\boldsymbol{f}\mathrm{d}\boldsymbol\theta\]
\[p(\boldsymbol{y} \vert \boldsymbol{x}) = \iiint p(\boldsymbol{y} \vert \boldsymbol{f}, \boldsymbol{x})\, p(\boldsymbol{f}\vert \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}) \mathrm{d}\boldsymbol{f}\mathrm{d}\boldsymbol\theta\mathrm{d}\boldsymbol\mu\]
Comments